1.) Consider a screening test for a disease that is to be administered to members of a par- ticular target population. The event S denotes a positive screening test for a patient; S0 is the complementary event, i.e. a negative screening test. The event D denotes presence of the disease in the patient, and the event D’ denotes absence of the disease. The prevalence of the disease in the population, P(D), is the probability that a patient chosen at random from the population has the disease, and this probability is known from previous studies: P (D) = 0.01. The sensitivity, P (S | D), and specificity, P (S’ | D’), of the screening test have been measured and are known: P(S|D)=0.9,andP(S’|D’)=0.8. (a) Write and expression for P (S n D) in terms of sensitivity and prevalence, and evaluate numerically. P(S | D) = P(S n D) / P(D) P(S n D) = P(S | D) * P(D)

sensitivity <- 0.9
prevalence <- 0.01
S_D_intersection <-c()
S_D_intersection = sensitivity*prevalence
S_D_intersection
## [1] 0.009
  1. What is the interpretation of P (S | D’)? The probability that there is a positive screen given that no disease is present. This is known as a false positive.
  2. Write an expression for P (S | D’) in terms of specificity, and evaluate numerically. P(S | D’) = 1 - P(S | D)
false_positive_rate <- c()
specificity<- 0.8
false_positive_rate = 1 - specificity
false_positive_rate
## [1] 0.2
  1. Write an expression for P (S n D’) in terms of specificity and prevalence, and evaluate numerically. P(S | D’) = P(S n D’) / P(D’) P(D’) = 1 - P(D) P(S | D’) = P(S n D’) / (1 - P(D)) P(S n D’) = P(S | D’) * (1-P(D))
S_Dprime_intersection<-c()
S_Dprime_intersection=false_positive_rate*(1-prevalence)
S_Dprime_intersection
## [1] 0.198
  1. Write an expression for P(S) in terms of sensitivity, specificity, and prevalence, and evaluate numerically. P(S) = P(S n D) + P(S n D’) P(S) = sensitivity*prevalence + (1-specificity)(1-prevalence)
positive = (sensitivity*prevalence) + (1-specificity)*(1-prevalence)
positive
## [1] 0.207
  1. What is the meaning of P (D | S), and why is it important? It is the predictive value of a positive screen. It is important because it gives you the probability of disease given a positive test result, taking in to account the prior probability of the patient having the disease (the prevalence in the general population)
  2. Write an expression for P (D | S) in terms of sensitivity, specificity, and prevalence, and evaluate numerically. P(D|S) = P(S|D)P(D) / P(S)
predictive_value<-c()
predictive_value=(sensitivity*prevalence)/positive
predictive_value
## [1] 0.04347826
  1. Given a sensitivity of 0.9 and a prevalence of 0.01, what specificity would be required to make P(D|S) = 0.99? predictive_value=(sensitivity*prevalence)/positive

positive = (sensitivityprevalence) + (1-specificity)(1-prevalence)

predictive_value=(sensitivityprevalence)/((sensitivityprevalence) + (1-specificity)*(1-prevalence))

predictive_value(sensitivityprevalence+(1-specificity)(1-prevalence))=(specificity*prevalence)

(sensitivityprevalence+(1-specificity)(1-prevalence))=(specificity*prevalence)/predictive_value

(1-specificity)(1-prevalence)=((specificityprevalence)/predictive_value)-(specificity*prevalence)

(1-specificity)=((sensitivityprevalence)-(sensitivityprevalencepredictive_value))/(predictive_value)(1-prevalence)

-specificity=((sensitivityprevalence)-(sensitivityprevalencepredictive_value))/(predictive_value)(1-prevalence)) - 1

predictive_value=0.99
specificity = -((sensitivity*prevalence)-(sensitivity*prevalence*predictive_value))/((predictive_value)*(1-prevalence)) + 1
specificity
## [1] 0.9999082
  1. In phototransduction in vertebrate rods, a light-activated molecule of rhodopsin, R, acts as an enzyme that catalyzes the exchange of GDP for GTP on the ↵-subunit of a G-protein, G↵, to produce an activated G-protein subunit, G↵-GTP. A free (unbound) R can either be deactivated by phosporylation catalyzed by rhodopsin kinase, with probability p = 1/11, or it can bind G↵-GDP and catalyze the GDP-GTP exchange, with probability (1 p) = 10/11. Every time R* catalyzes the GDP-GTP exchange it becomes free again afterwards. Let us use N to refer to the random number of G↵-GTP produced by R* before it is deactivated.
  1. What type of probability mass function does N have? Be sure to be specific by stating the parameter value. Write the equation of the probability mass function, Pr{N = n}.

N has a geometric probability mass function. The parameter value is the probability 1/11. N is a random number of G produced before R* is deactivated. Pr{N = n} = p(1-p)^N*

  1. What is the mean, μ, of N in terms of p? Evaluate numerically. Note that Table 3.4 on page 93 of the textbook has an incorrect equation for the mean of this probability mass function. (1-p)/p
p <-1/11
μ<-c()
μ=(1-p)/p
μ
## [1] 10
  1. What is the variance, sigma^2, of N in terms of p? Evaluate numerically. (1-p)/p^2
sigma_squared<-(1-p)/p^2
sigma_squared
## [1] 110
  1. Use R to find the probability that N>=20, and state the R command.
1-pgeom(20, p=1/11)
## [1] 0.1351306
  1. In R, draw 100,000 random samples from a population with your probability mass function in part (a), and store in an array called X. Compute the sample mean and sample variance of X. Are these sample values close to your theoretical population values in parts (b) and (c)? Yes, the sample values are nearly the same as the theoretical population values.
n=100000
X<-rgeom(n, prob=p)
mean(X)
## [1] 10.06536
var(X)
## [1] 111.5837
#install.packages("plotly")
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.1
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(boot)
## Warning: package 'boot' was built under R version 3.4.1
load("midterm_diabetes.RData")

boxplot(diabetes, normal)

stripchart(diabetes)
stripchart(normal, method="stack", add=TRUE, at=1.2)

plot_ly(alpha=0.6) %>%
  add_histogram(x=diabetes, name = "diabetes") %>%
  add_histogram(x=normal, name = "normal") %>%
  layout(barmode="overlay")
par(mfrow=c(2,1))
hist(diabetes)
hist(normal)

effect_size = mean(diabetes)-mean(normal)
effect_size
## [1] 2.507339

The assumptions of a parametric teset are that the sample is representative of the population, the samples are selected independently, the response variable is normally distributed, and that the standard deviation and variance are equal in the two groups. The advantage is that if it is normally distributed, it is a very accurate test, and t tests are robust to small violations of the assumptions. The assumptions of a nonparametric test are that it makes no assumptions about the distribution of the data, but that the data is symmetric about the mean or the median. The advantage is that since it makes few assumptions, you do not have to know much about your data to use one of these tests, and you can use it for smaller datasets. These tests are also more robust to outliers, and can be applied to ordinal as well as continuous data. The disadvantage is that these tests are of lower power than comparative parametric tests. The assumptions of a permutation test are that the data from your experiment is representative of the population. If it is not, you will replicate results that are not correct.

t.test(diabetes, normal)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes and normal
## t = 4.7627, df = 143.22, p-value = 4.631e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.466719 3.547958
## sample estimates:
## mean of x mean of y 
##  6.938536  4.431197
wilcox.test(diabetes, normal)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  diabetes and normal
## W = 4018, p-value = 4.508e-06
## alternative hypothesis: true location shift is not equal to 0
Z <- c(normal, diabetes)
X<-Z[1:length(normal)]
Y<-Z[length(normal)+1:(length(diabetes))]
x.bar<-mean(X)
y.bar<-mean(Y)
meanDiff<-x.bar - y.bar
n.rep<-10000
my_list <-0
for (i in 1:n.rep){
  sampleZ <- sample(Z, replace=FALSE)
  X2<-sampleZ[1:length(normal)]
  Y2<-sampleZ[length(normal)+1:(length(diabetes))]
  x2.bar<-mean(X2)
  y2.bar<-mean(Y2)
  meanDiff2 <- x2.bar - y2.bar
  if(abs(meanDiff2) >= abs(meanDiff)){
    my_list <- my_list+1
  }
}
pval <- (my_list+1)/(n.rep+1)
pval
## [1] 9.999e-05
library(boot)
drosophila<-read.csv(file="q2_drosophila_body_part.tsv", sep="\t")
x<-c()
trimmed_mean<-function(x){
  mean(x, trim=0.1)}
dro.boot<-boot(drosophila$X1.801, function(x,i) trimmed_mean(x[i]), R=1000)
dro.boot
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = drosophila$X1.801, statistic = function(x, i) trimmed_mean(x[i]), 
##     R = 1000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 2.453877 0.0112871   0.1942936
boot.ci(dro.boot, conf=0.99, type="basic")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = dro.boot, conf = 0.99, type = "basic")
## 
## Intervals : 
## Level      Basic         
## 99%   ( 1.894,  2.913 )  
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable

The advantage to this method is that we do not have to repeat the experiment, as our sample likely is similar to the whole population of drosophila. This test also does not make any assumptions on the distribution of the data, and it can be used for a smaller sample size.

library(plotly)
load("midterm_fly.RData")

plot_ly(x=fly_corn_diet,type = "box", name = "Corn Diet") %>%
  add_trace(x=fly_cotton_diet, name = "Cotton Diet")
par(mfrow=c(2,2))
plot(density(fly_corn_diet))
plot(density(fly_cotton_diet))
invcorn<-1/(fly_corn_diet)
invcotton<-1/(fly_cotton_diet)
plot(density(invcorn))
plot(density(invcotton))

plot_ly(alpha=0.6) %>%
  add_histogram(x=fly_corn_diet, name = "Corn Diet") %>%
  add_histogram(x=fly_cotton_diet, name = "Cotton Diet") %>%
  layout(barmode="overlay")
effect_size = mean(fly_corn_diet)-mean(fly_cotton_diet)
effect_size
## [1] 0.5042262

I will use a nonparametric test because based on the scatter plot, the Cotton Diet group has many outliers. It is also difficult to tell what the distribution is, as it does not look normally distributed based on the histograms I made. Additionally, there is a relatively small number of samples. I would like to use the nonparametric tests because they are robust to outliers and do not assume the distribution of the data. The Wilcox rank sum test will be the best fit, as the two sample populations are independent. Additionally, I tried a student t.test with the inverse transformed data, because the data (at least, the cotton diet data) looked more normally distributed when transformed with the inverse function. I believe that the non-parametric test is more accurate for this data. t.tests are somewhat robust to violations of the assumption that the data is normally distributed, so the results of this test may still be relatively accurate. I also used one sided tests because it is clear that the corn fed eye stalks are only longer and will not be less than the cotton diet ones.

wilcox.test(fly_corn_diet, fly_cotton_diet, paired=F, alternative="greater")
## Warning in wilcox.test.default(fly_corn_diet, fly_cotton_diet, paired =
## F, : cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  fly_corn_diet and fly_cotton_diet
## W = 468.5, p-value = 4.429e-07
## alternative hypothesis: true location shift is greater than 0
t.test(invcorn,invcotton,paired=F, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  invcorn and invcotton
## t = -7.0828, df = 24.141, p-value = 2.455e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2327767 -0.1277524
## sample estimates:
## mean of x mean of y 
## 0.4891166 0.6693812
  1. A p-value, in conditional notation, is Pr{getting a value at least this extreme | H0}, meaning the probability that we would see a value equal to this (or more extreme[greater or lesser]) given the null hypothesis is true. The p value does not tell you the chance that a hypothesis is correct, but rather it tells you what the likelihood is that the effect you observed is not due to random chance. Thus, you need to estimate the prior plausibility of a hypothesis before you can tell how valuable a p value is.

For a completely unknown plausibility, that is, a “toss-up,” you can estimate the “strength” of your p value. In a “toss-up” situation, if an independent lab tried to replicate an experiment with a p value of 0.05, you might expect a 29% chance of being unable to replicate the results - no effect will be seen. With a p value of 0.01, an independent lab would still have an 11% chance of not being able to replicate the results. However, if it’s a “good bet” that the alternate hypothesis is true, then those same p values, 0.05 and 0.01, will give you a 4% chance and a 1% chance, respectively, of not being able to see a real effect. Thus, the strength of the p value depends on how likely your hypothesis is in the first place.

Some important caveats for using p values are: 1) A p-value cannot make statements about reality, only what the data shows. 2) Obfuscating the effect size with a strong p-value. If the effect size is negligible, but with a strong p-value, the results are not that interesting, but others may be misled into thinking the results are more important than they are. Finally, 3) p-hacking, or dredging through the data to make a comparison with a p-value of at least 0.05 so that you can publish your results, with no concern about the prior probability of the hypothesis is a problematic approach. This blurs the line between an exploratory study and a confirmatory study, and is not advised. It is highly likely that the p-value gotten from p-hacking is illegitimate and does not display a real or interesting effect.